-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta Lake Catalog Exporter #7078
Conversation
change response of delta_exporter, remove prints, remove old impl
if ns == nil then | ||
error("failed getting storage namespace for repo " .. repo) | ||
end | ||
local delta = formats.delta_client("s3", lakefs_key, lakefs_secret, region) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully in the context yet but I can't help not wonder already.
Is there a reason to depend on passing lakeFS credentials here?
Until now we worked with the policy that lua lakeFS client is using the context of the logged in user in the context during runtime to authenticate with lakeFS api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned F2F, this is due to the need to supply the delta-go package's client AWS key and secret to communicate with the S3 gateway.
On top of that, the administrator might configure a user that has permission to read the Delta table from its location such that every commit will export it even if the user that committed it doesn't have the right permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% 👍
Should document (in general I know not in this PR, what lakeFS permissions are required)
The technical problem reusing same technique as lakeFS lua client is because of the creds provider that expects credentials. 😢
CredsProvider: aws.CredentialsProviderFunc(func(ctx context.Context) (aws.Credentials, error) {
local function sortedKeys(query, sortFunction) | ||
local keys, len = {}, 0 | ||
for k,_ in pairs(query) do | ||
len = len + 1 | ||
keys[len] = k | ||
end | ||
if sortFunction ~= nil then | ||
table.sort(keys, sortFunction) | ||
else | ||
table.sort(keys) | ||
end | ||
|
||
return keys | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the sort function needed here, don't see its being used? If it's not needed let's remove it.
If you prefer to keep it then no need to check for nil:
local function sortedKeys(query, sortFunction) | |
local keys, len = {}, 0 | |
for k,_ in pairs(query) do | |
len = len + 1 | |
keys[len] = k | |
end | |
if sortFunction ~= nil then | |
table.sort(keys, sortFunction) | |
else | |
table.sort(keys) | |
end | |
return keys | |
end | |
local function sortedKeys(query, sortFunction) | |
local sort = sortFunction or | |
local keys, len = {}, 0 | |
for k,_ in pairs(query) do | |
len = len + 1 | |
keys[len] = k | |
end | |
table.sort(keys, sortFunction) | |
return keys | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in delta_exporter
.
Not sure I understand the suggested change, but I do agree that the nil check is redundant. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant specifically the sortFunction
argument in the function, I don't see it being used anywhere?
If you prefer to leave it regardless, please add a comment on the sortFunction
expected signature otherwise it'll be hard to use it without going into the implementation of sortedKeys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sortFunction
is passed to table.sort
to apply the sorting logic (otherwise it uses default lexicographic sort). I don't need to pass, but since it's a util function, the API should suggest a way to sort keys.
I'll add a comment.
Lua refactoring Remove unneeded `listeningAddress` member from `DeltaClient`
lua-require packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review WIP - if you prefer all at once LMK 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General review comment - what is the plan about tests?
-
Reference to glue and symlink exporter test integration in esti - can we add a test here?
-
Lua package unit testing - see reference hive_partition_pager.lua test
♻️ PR Preview 2b9f8a2 has been successfully destroyed since this PR has been closed. 🤖 By surge-preview |
…ient since the storage namespace can be fetched from the global action object
…ete key name (including ".json" suffix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Isan-Rivkin, tests will be added on a different PR.
Thank you 🙏
…to the export_delta_log function
…sing the S3 gateway). Use regex to validate lakeFS address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I left 2 comments un-resolved as reminders incase it's missed - feel free resolve.
Thank you for taking the time and addressing the comments!
Can't wait to export to Delta 🕺
Closes #6965
Changes
encoding/json
lua library to accept an optional options table. This is in order to pass an indentation if needed.listeningAddress
variable to the actions service. This is to allow hooks to communicate back to the running lakeFS server if they run on the same host, e.g. lua hooks.stat_object
andget_repo
operations.NOTICE 1
This PR depends on the merging of treeverse/delta-go#2.
NOTICE 2
Documentation will be added in a different PR.
How was this tested?
Manually tested on the following Delta Lake log constructs:
000000000000000000.json
).000000000000000004.json
and all other json entries up to and including checkpoint00000000000000000010.checkpoint.parquet
, and entry00000000000000000010.json
)00000000000000000010.checkpoint.parquet
, followed by a number of log entries).